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Linear discrimination, from the point of view of numerical linear algebra, can be treated 
as solving an ill-posed system of linear equations. In order to generate a solution that is 
robust in the presence of noise, these problems require regularization. Here, we examine 
the ill-posedness involved in the linear discrimination of cancer gene expression data with 
respect to outcome and tumor subclasses. We show that a filter factor representation, 
based upon Singular Value Decomposition, yields insight into the numerical ill-posedness 
of the hyperplane-based separation when applied to gene expression data. We also show 
that this representation yields useful diagnostic tools for guiding the selection of classifier 
parameters, thus leading to improved performance. 
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1. Introduction 

A current challenge in cancer treatment is to target specific therapies to distinct 
tumor types and to tailor the intensity of the treatment to the risk of relapse for 
each patient 8 ' 32 ' 33 . Crucial to this effort is to effectively classify patients into specific 
risk groups. Traditionally, many risk stratification schemes use tumor morphology, 
molecular genetics and cytogenetics in addition to utilizing information such as race, 
age, etc. 35 Clinically, however, this approach has limitations. Tumors with similar 
appearances may have different clinical courses and display different responses to 
therapy. In addition, conventional laboratory diagnostic procedures do not reveal 
the full underlying molecular heterogeneity of these tumors. For this reason, there 
has been great interest in using gene arrays to identify more precisely known tumor 
subclasses, to discover new tumor subclasses, and to predict outcome, i.e. whether 
or not a patient's cancer will go into remission. 

Our goal in this work is to investigate the classification of cancer patient samples 
according to tumor lineage and outcome via hyperplane classifiers, i.e. we attempt 
to linearly separate with a hyperplane patient samples via their expression pro- 
files. This is numerically challenging since gene expression data is characterized 
by a noisy, high-dimensional, low-sample-size setting. Hyperplane classifiers cope 
with this challenging setting by incorporating prior knowledge assumptions that 
constrain the solution (the hyperplane parameters) in some way. The term regu- 
larization refers to this incorporation of prior information in order to stabilize the 
problem and to sift out a desired solution. With respect to hyperplane classifiers, 
there are many choices but each one encodes a different strategy for how to stabilize 
the solution, and, in this paper, we focus on the filter factor representation using 
singular value decomposition (SVD) as the particular regularization strategy. While 
some researchers have used SVD-based approaches 14 ' 27 for cancer classification, to 
the best of our knowledge, there has been no work on differentiating these methods 
on the basis of a filter factor representation. In addition, in seeking to evaluate why 
a certain regularization strategy should be preferred over another in the context of 
gene expression data, the type of numerical ill-posedness (rank- deficiency or dis- 
crete ill-posedness) exhibited by expression data ought to be taken into account. To 
date, this has not been done. Moreover, the filter factor representation involves the 
use of spectral coefficients that can be used to estimate how much regularization 
should be applied. 

In Section 2, the mathematical notation used in subsequent sections will be 
detailed. In Section 3, we discuss kernel versions of regression-based hyperplane 
classifiers and the regularization approaches that these hyperplane classifiers can 
adopt when using a SVD-based filter factor representation. In Section 4, we review 
the cancer expression data sets used in our analysis and the methods used for data 
preprocessing and parameter estimation. Section 5 presents the classification results 
for hyperplane classifiers across a variety of regularization schemes and describes 
how spectral coefficients such as the singular values and Fourier coefficients can be 
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used to tune classifier parameters. Finally, we present the conclusion in 6. 
2. Notation 

The matrix X = [x\, . . . , x n ] £ M. dxn denotes the gene expression data derived from 
n patient tissue samples with each patient expression vector Xi £ M. d consisting of 
d gene attributes. The vector y £ M. n consists of n binary-valued class labels where 
iji = {— 1, +1} indicates membership in cither the negative or positive class. With 
respect to acute leukemias, for example, one might assign the class label yi = — 1 
or yi = +1 to indicate membership in either ALL (acute lymphoblastic leukemia) 
or AML (acute myeloid leukemia), respectively. The matrix Y is an n x n diagonal 
matrix such that Y — diag(y) = diag(yi, . . . , y n ) and Y 2 = I n . The vectors 1„ and 
n represent column vectors of n ones and zeros, respectively. The transpose of a 
matrix or vector is denoted by the superscript T, while || • ||2 denotes the two-norm 
of a vector. The superscript p will refer to a specific partition of patient samples 
into a training and test set (this will be explained further in §5). 

Given X and y, one attempts to learn the coefficients, w £ R d and wq, that 
determine a separating hyperplane given by the following mathematical expression: 
g(x) = x t w+wq = 9 . Geometrically, different choices ofw and wq typically yield 
different hyperplanes since w determines the tilt of the hyperplane, while wq is the 
bias, or offset from the origin. To simplify notation, the data is often augmented to 
incorporate w$: w = [w T , wq) t £ M. d+1 and x = [x T , 1] T £ M. d+1 denote augmented 
weight and gene expression vectors, respectively. Hence, g{x) = x T w + wo can be 
simply expressed as g(x) = x T w. Similarly, X = [X T ,l n ] T £ R( <i + 1 ) xn is the 
augmented data for the entire gene expression matrix. This augmentation adds a 
constant feature to the training data such that the separating hyperplane passes 
through the origin in R d+1 . The algorithm that determines the optimal hyperplane 
parameters w will be referred to as a hyperplane classifier, and the data X and y 
used to determine w is referred to as the training data. Once w is obtained, the 
formal classification task is to assign class membership to a new input z £ M. d : if 
g{z) < (g(z) > 0), then z is assigned to the negative (positive) class. If g(z) = 0, 
then the class membership of z is indeterminate. If Z = [z\, . . . , z^] £ R^ d+1 ') xN 
consists of TV augmented test set expression profiles distinct from the training set 
and t £ M. N is its corresponding set of class labels such that t; = {— 1, +1}, then 
the number of misclassifications M can be computed as follows: 



where /[■] is the indicator function such that I[9] = 1 if 8 < and I [6] = if 8 > 0. 

3. Filter Factor Representations of Hyperplane Classifiers 

A hyperplane classifier, in its simplest form, specifies a linear relationship between 
a response variable, y, and a set of predictor variables, X. For many problems, 
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estimates of the linear relationships between variables are adequate to describe 
the observed data and to make reasonable predictions for new observations. This 
has been well-established in the case of cancer classification using gene expression 
data. 14 ' 23 ' 24,27 However, since the number of samples (n) is much less than the num- 
ber of genes (d), the samples sparsely populate a very high dimensional gene space 
and, as a result, there is a strong likelihood of finding many perfectly separating 
hyperplanes for the training data. Another complicating factor is the presence of 
significant biological and experimental variability in the expression data. Assuming 
that a linear discrimination approach to the classification of gene expression data 
will generally suffice, we therefore focus on deciding which regularization strategy 
is appropriate for extracting a stable set of hyperplane parameters in the presence 
of noisy and high-dimensional data. 



3.1. Kernel-based regression 

Suppose that the hyperplane classifier attempts to find w by solving the regression 
problem X w = y using least squares (LS) minimization: 



- T 

X w — y 



w * — * 

1=1 



(2) 



LS methods for solving for the weight vector w (hereafter referred to as the pri- 
mal variable), such as QR factorization, involve 0(dn 2 ) floating point operations 
(flops) 10 . When d 3> n, one can reduce the computational complexity by working 
with the kernel version of Eq.(2) instead. Assuming that w can be rewritten as 
a linear combination of the training data, i.e., w = X(3, (3 G R n , (3 (hereafter 
referred to as the dual variable) can be determined instead: 
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where K = X X £ M. nxn will be referred to as the linear kernel matrix. Solving 
Eq.(3) now scales cubically with the number of patient samples n. 



3.2. Need for regularization 

If X is ill-conditioned, then the computed solution (3 will not be stable. Ill- 
conditioning is always the case with coefficient matrices derived from classification- 
based regression problems. For example, if X = [X +1 XJ\ is a data partitioning 
with respect to the positive and negative classes, then the feature or attribute pro- 
files of data points within X + and X_ will be highly correlated because it is this 
commonality which defines the class. As a result, the effective numerical rank r eff 
(the number of columns of X that are linearly independent with respect to some 
error level) will be small due to the multicollinearity of the columns within X + 
and X_. In the case of gene expression data, r ob3 can vary significantly due to the 
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large amount of within-class and between-class biological variability. In addition, 
the large amount of noise associated with gene arrays can artificially inflate r off such 
that the observed rank r ob3 will be greater than r eff . In general, r obB = n (full rank) 
but r off < r obs . 

The phenomenon of solution instability can be illustrated by expressing the 
ordinary least squares (OLS) solution to Eq.(3) in terms of the singular value de- 
composition (SVD) of X 10 > 28 . IfX= UY.V T is the SVD of X, where 

U = [ui,...,u n ] 6R( d+1 ' x ", V=[vi,...,v n ]eR nXn , 
U T U = V T V = I n , 
£ = diag(<7i, . . . ,cr„), <ri > ■■ • > cr„ > 0, 

then the solution (3 can be given in terms of X and V: 

(3 = K'y = V^ 2 V T y = W^ (4) 

i=l V °i ' 

Since we assumed X to have full rank (even though it is ill-conditioned), we use 
K 1 instead of replacing it with the pseudoinversc . In Eq.(4), the terms asso- 
ciated with small singular values (i.e., terms associated with large i) correspond 
to spurious noise inherent in the data. Division by small singular values unduly 
amplifies the effect of these noise terms and has a direct analogue with overfitting: 
if we include terms associated with small singular values, then the solution will 
have low bias and high variance 21 . Numerically, this overfitting will manifest itself 
as a large solution norm for (3 since \\f3\\ 2 = Y^i=i\( v T V) I a l] 2 ■ Regularization in 
the context of the SVD expansion of Eq.(4) amounts to eliminating or damping 
noise terms associated with small singular values. This, in effect, imposes a prior 
knowledge assumption of "smoothness" on the solution since large values of \\B\\\ 
are penalized. 



3.3. Filter factor representations 

There are a variety of regularization techniques that can be used to impose solution 
smoothness. We are particularly interested in techniques that admit filter factor 
solutions 19 . A filter factor representation is a rewcighting of Eq.(4) and has the 
form 

(3 = VF^-V T y = J2 (£f) fiVi, (5) 

i=l ^ a i ' 

where F = diag(/i, . . . , /„) is a diagonal matrix consisting of the filter factors 19 
or shrinkage coefficients 21 . The filter factors typically decay to zero as i increases 
such that contributions from terms with small singular values arc filtered out. For 
the OLS solution, no damping occurs since F = I n (or f\ = ■ ■ ■ = f n = 1). How- 
ever, least squares (LS) techniques such as truncated singular value decomposition 
(TSVD), kernel ridge regression (KRR) and partial least squares (PLS) have non- 
trivial filter factor representations as a result of solving a modified LS problem 
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involving the dual variable (3. Each of these LS methods, through its respective 
filter factor representation, encodes a different regularization strategy in order to 
suppress the noisy terms in the SVD expansion of Eq. (4) . We briefly describe each 
of these methods in turn. 

3.3.1. TSVD 

For TSVD, the approach to damping noise terms in Eq.(4) is to simply eliminate 
them entirely. Keeping the first k terms amounts to having binary-valued filter 
factors such that /i = ••• = /& = 1 and fk+i = ■ ■ ■ = f n = 0. Alternatively, one 
can arrive at the same solution by solving the modified LS problem, Kk/3 = y, in 
which Kk = (JiVivf is the rank-fc approximation of K obtained by replacing 

the smallest n — k singular values with zeros. 

3.3.2. KRR 

For KRR, the modified LS problem involving f3 is derived by minimizing the fol- 

~ T 

lowing loss function, L{w) = \\X w — y||| + A 2 ||tb|||, by setting Vu,L(w) = Od+i 
and replacing w with X(3: 

V^Liw) = 2{XX T + \ 2 I d+1 )w - 2Xy = d+1 => (K + A 2 I n )0 = y. (6) 

Note, however, that the linear system in Eq.(6) involving f3 is not of the form 
typically associated with classical ridge regression or Tikhonov regularization, 

(A T A + j 2 B T B)x = A T b, (7) 

in which the matrices A £ ]R mx ™ anc j B £ M. nxp (m > n > p) are not assumed to 
be symmetric positive definite (SPD). Instead, {K + X 2 I n )f3 = y corresponds to a 
related regularization scheme first proposed by Franklin 12 (hereafter referred to as 
the Franklin regularization scheme) in which he suggested replacing Eq.(7) with 

(A + jB)x = f>, 7 > 0, (8) 

when A £ W ixn and B £ R nxn are SPD. If B = I n and {^i, ...,ip n } are the 
singular values of A, then the filter factors are the following 19 : 

t f V'i /(^i + 7 2 )i Tikhonov regularization, 
1 I V'i/CV'i + 7)7 Franklin regularization. 

Hence, KRR, in the context of Eq.(7), is nothing more than the Franklin regu- 
larization scheme with A = K (where ipi = of), B = I n , x = (3, b = y and 
7 = A 2 . 

In the context of support vector machine (SVM) classification 9 , KRR also 
has connections with the two-norm SVM formulations, namely, Proximal SVM 
(PSVM) 13 and Lagrangian SVM (LSVM) 26 . Cast as an optimization problem, the 
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objective functions associated with these particular SVM formulations are identical, 
while the constraints differ slightly: 

1 v f Y (x T w) + £ > l n (LSVM) 

mini||A||i + ^ll€ll2 s.t. I L 1 ' (9) 

* 2" 112 2 il<S " 2 I Y (X w) + £ = 1„ (PSVM) W 



where £ € M™ is a vector of slack variables 7 and v is a parameter which controls 
the overall amount of misclassification violation 9 . If the data points were linearly 
separable, the LSVM would require that all of the data points lie outside of the 
corridor defined by g{w) = x T w < 1. Numerically, this is described by the set of 
constraints: 

X T w^j > 1„. (10) 

However, for data sets that are not linearly separable, the constraints in Eq.(10) 
are never simultaneously satisfied. To allow for the likely possibility of data points 
lying on the wrong side of the separating hyperplane (or allowing for slackness 
in the constraints), £ and v are introduced to allow for the violation of Eq.(10). 
In effect, one can view the LSVM relaxation of the inequality constraints as a 
rcgularization mechanism which imposes a "smoothness" prior on the solution w. 
The corresponding dual version of Eq.(9), obtained using the Karush-Kuhn- Tucker 
conditions of optimization theory 7 , can then be expressed, after some algebraic 
manipulation, as the following constrained LS problem, 

,2, ^ f Y/3 > 0„ (LSVM) 

solve (K + X 2 I n )(3 = y s.t. | Rfl (11) 

where A = \ jv. For the PSVM, (3 is unconstrained in Eq.(ll) and hence is equivalent 
to KRR (this was first observed by Agarwal 2 ). For the LSVM, the data points 
associated with the nonzero component values of /3 arc called the support vectors. 
Geometrically, these data points lie on or within the canonical corridor defined by 
g(x) = x T w < 1 and these points alone determine the orientation of the hyperplane 
since w = X(3 = J2p z ^o P&i- 

3.3.3. PLS 

The Krylov subspace, K m (A, b) — span{6, Ab, . . . , A m_1 6}, is often associated 
with the conjugate gradient (CG) algorithm applied to Ax = b when A is SPD 28 . 

~ T 

In solving the regression problem X w = y, Holland showed that the PLS-based 
weight vector w lies in the Krylov subspace JC m (XX ,Xy) 22 . If we express this 
Krylov subspace in terms of K and (3, we have 

IC rn {XX T , Xy) = span{±y, XKy, . . . , XK^y} = span{X/3}, (12) 
where /3 £ K, m {K ,y). As a result, the modified LS problem associated with PLS is 

K/3 = y, (3€lC m (K,y). 
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The filter factors can then be expressed in terms of the Ritz polynomial, p m (9) 



a consequence of the Lanczos tridiagonalization process inherent in the CG 
algorithm 19 . At step m, the Lanczos tridiagonalization produces a tridiagonal ma- 
trix T m by taking the QR factors of successive Krylov subspaces, Q. m R m := 
tC m (K, y), such that T m = Qj n KQ m 2& . The eigenvalues of T m , denoted by 9i(m), 
are called the Ritz values. The filtering properties of /, are controlled by the con- 
vergence of #™ to the nonzero eigenvalues of K, i.e. diag(£ 2 ) = {er 2 , . . . , c 2 }. This 
convergence, in turn, is controlled by the number of columns m spanned by the 
Krylov subspace. If 9i(m) has converged to of, then p m (Qi(m)) = and = 1. As 
m approaches n and if all of the Ritz values converge to their corresponding eigen- 
values of K, then all of the filter factors will approach one, and the PLS solution 
will coincide with the OLS solution. However, in practice, as m increases, one often 
gets multiple copies for large values of 9i(m) (ghost values) due to finite- precision 
arithmetic 28 . Hence, the Ritz values disproportionally converge to the largest eigen- 
values of K as m increases. This results in a delayed or lack of convergence of 9i(m) 
toward of. Therefore, the PLS solution generally does not coincide in real computa- 
tions with the OLS solution for m = n since convergence to the smallest eigenvalues 
of K is rarely achieved. As a result of the delayed convergence, an optimal or near 
optimal m often satisfies m <C n for ill-conditioned coefficient matrices K — see 



3.3.4. Nonstandard filter factor representations 

Filter factor representations can easily accommodate non-standard filtering pro- 
cedures. Standard filtering procedures typically assume that, in the SVD expan- 
sion, the terms corresponding to the largest singular values are the most impor- 
tant and should be kept, while the terms corresponding to the smallest singular 
values should be filtered out or damped. This not need be the case with gene 
expression data. In Alter et al. 3 , the largest singular value and its correspond- 
ing right and left singular vectors were removed in order to filter out the steady 
state of yeast cell cycle expression. Bair et al. 6 introduced a mixture model for 
outcome to accommodate other sources of non-outcome variation in which the 
outcome class label vector y was expressed as a linear combination of the first 
two singular vectors: y ps au\ + (1 — a)u 2 + e„, a e [0,1], where e„ is an 
n x 1 error vector in which each component value is Gaussian with unit vari- 
ance. When a = 1 and a = 0, y is correlated with U\ and U2, respectively, since 



[3 = UFT, 2 U T y ps [fi/af]aui + [/ 2 /cr 2 ](l — a)u 2 . Hence, for a = 1, we have 
fi ps 1 and fijki ps 0, while for a = 1, we have, f 2 ~ 1 and ps 0. 




(13) 



Fig.(l). 
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3.4. Extent of regularization 

For TSVD, KRR and PLS, one must decide upon the value of k (the number of sin- 
gular values and vectors kept in the SVD expansion of Eq.(4)). A (the multiplicative 
factor of the identity matrix added to K in Eq.(6)) and m (the number of columns 
spanned by the Krylov subspace K, m ) using information obtained from the training 
data only. The choice of the regularization parameter dictates the values of the filter 
factors which, in turn, control the amount of smoothing the solution (3 will undergo. 
For TSVD and PLS, the regularization parameters k and m are the integer- valued 
entities {1, 2 . . . , n}, and the smaller the regularization parameter, the greater the 
amount of smoothing. A regularization parameter of n effectively amounts to an 
absence of regularization since fi « 0(1) for any i £ {1,2, ... ,n}. For KRR, how- 
ever, the regularization parameter A is nonnegativc and real-valued — the larger the 
value, the greater the amount of smoothing and vice-versa. In particular, the effec- 
tive range of regularization for A lies in the interval [0, o~{\ since A = corresponds 
to no regularization (/i = • • • = /„ = 1) while A = <j\ roughly amounts to keeping 
the first term of the SVD expansion (/i = 1/2 « 0(1) and fi f=a O(e) for i > 2 and 
0<6«1). Note that for TSVD and KRR, the filter factors lie within the interval 
[0, 1]; see Fig.(l). From the point of view of regularization, this makes perfect sense 
since we either want to keep an SVD expansion term (/, m 1) or suppress an SVD 
expansion term (/,; w 0). However, the filter factors for PLS can lie outside of [0, 1] 
since fi = 1 — Pm{o~f) is a polynomial of degree m and can assume large positive 
and negative values as p m (0) oscillates; see Fig.(l). Hence, the regularization effects 
of PLS can be difficult to interpret. Additional details of the regularizing effects of 
the CG algorithm in the context of PLS can found in the work of Lingjaerde and 
Christopherson 25 . 

3.5. Choice of filter factors and ill-posedness 

The choice of filter factors should be informed by the type of numerical ill-posedness 
exhibited by X . There are generally two types of numerical ill-posedness in the 
case of linear systems: rank- deficiency and discrete ill-posedness 19 . Rank-deficiency 
is characterized by a large gap between o~k and o~k+i in the singular value spectrum 
of X in which the last n — k singular values are assumed to reflect spurious noise 
inherent in the data. In this case, r eff = k and the preferred treatment for handling 
such ill-posedness is the TSVD scheme: truncate the last n — k terms in the SVD 
expansion. On the other hand, discrete ill-posedness is characterized by a slow decay 
of the singular values in which there is no well-determined gap in the singular value 
spectrum. For such problems, truncation of the SVD expansion may not lead to the 
best-regularized solution. If we truncate too early, then we may lose information, 
and if we include too many terms, then the solution can become unstable in the 
presence of noise. As a result, one may want to compromise by reweighting all of the 
terms such that terms with small singular values are damped to a greater degree 
than terms with larger singular values. In this case, the preferred remedy would 
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Fig. 1. Comparison of various filter factors representations for the training data in the normalized 
UNM Infant ALL/AML data set (see §4). The filter factors /; (y-axes) are plotted against 
the summation index i (x-axes). The regularization parameters for TSVD, KRR and PLS are 
k = {5, 36} (the truncation parameters), A £ {(75, 0-35} (the scalar multiples of the identity matrix, 
In) and m G {5, 36} (the number of columns spanning the Krylov subspace, K, m {K, %/))■ 



TSVD KRR PLS 




be either KRR or PLS. In this study, we are interested in whether the type of 
numerical ill-posedness exhibited by the cancer gene expression data should dictate 
the type of regularization scheme used and whether this leads to improvement in 
overall classification performance. We now discuss our numerical results for several 
cancer gene expression data sets. 

4. Gene Expression Data Sets and Preprocessing 
4.1. Gene expression data sets 

Over the past several years, there have been a number of human gene expression 
studies that have considered the problem of classification with respect to tumor 
lineage and outcome. This paper focuses on the following data sets: 
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• UNM Infant ALL/AML 30 : In this study, 126 infant (< 1 year) sam- 
ples were grouped according to their major precursor origins within acute 
leukemia: acute lymphoblastic leukemia (ALL) and acute myeloid leukemia 
(AML). 

• MIT- ALL/ AML 15 : In this study, 72 patient samples were grouped ac- 
cording to ALL or AML. 

• UNM Pediatric ALL Outcome 29 : 254 pediatric patient samples (< 
18 years of age) with ALL were examined with respect to outcome. This 
data set includes distinct tumor sublineages (34 T-cell ALL and 220 B-cell 
ALL patient samples) and multiple molecular subtypes (the chromosomal 
aberrations that give rise to certain leukemias). 

• van't Veer Breast Cancer Outcome 32 : This data set concerns primary 
breast cancer outcome (patients who remained metastasis-free for at least 
five years versus those patients who developed distant metastases within 
five years) in a group of 95 patients selected for age (< 55 years) and a 
clinical indication of favorable prognosis (i.e. lymph node negative status 
and a tumor diameter less than 5 centimeters). 

The class distribution of the patient samples across the cancer gene expression 
data sets can be found in Table 1. The UNM Infant ALL/AML and the UNM 
Pediatric ALL Outcome data sets used the Affymetrix MAS 5.0 software 1 to 
generate expression levels for 12625 genes (actually cDNA probescts), while for the 
MIT ALL/AML data set, the Affymetrix MAS 4.0 software was used to generate 
expression values for 7129 genes. For the van't Veer Breast Cancer Outcome 
data set, spotted gene arrays (Hu25K microarrays) containing 24,481 genes were 
used. 



Table 1. Class distribution of patient samples across data sets. 



Data Set 


Training Set 


Test Set 


UNM Infant ALL/AML 


54 ALL; 35 AML 


24 ALL; 13 AML 


MIT ALL/AML 


27 ALL; 11 AML 


20 ALL; 14 AML 


UNM Pediatric ALL Outcome 


73 REM; 94 FAIL 


39 REM; 48 FAIL 


van't Veer Breast Cancer Outcome 


44 REM; 34 FAIL 


7 REM; 12 FAIL 



These gene array data sets were chosen since we wanted to consider class pre- 
diction tasks based upon tumor lineage and outcome. In approximate terms, tumor 
lineage and outcome are opposites with respect to the observed biological variability. 
We expect tumor lineage to have a higher signal-to-noise ratio than outcome since 
tumor lineage dominates the fate of cancer cells. Conversely, outcome embodies the 
full biological, genomic and environmental heterogeneity of an individual, making 
it difficult to isolate a unique or universal favorable or unfavorable signature within 
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the biological noise. Since hyperplane classifiers differ in how they stabilize the solu- 
tion in the presence of noise, certain hyperplane classifiers may to be better-suited 
to handling class prediction tasks that differ in signal-to-noise ratios. 

4.2. Gene selection 

In cancer gene expression studies, not all genes change substantially in response 
to disease. As a result, a filtering procedure is often used to reduce the total of 
number of genes, d, to a core subset in which the reduced number of genes, denoted 
by dl ', are differentially expressed with respect to the class distinction of interest. 
Here, a two-stage filtering procedure is used. First, genes are removed on the basis 
of qualitative measures of gene detection. Second, highly discriminating genes are 
chosen using a ranking procedure based upon weight- vector component magnitudes. 

For the data sets using the Affymetrix platform, all control genes/probesets 
were removed, i.e. all probesets having the "AFFX" prefix in the probeset ID 1 . 
In addition, any probeset that did not have at least one "Present" call value (as 
determined by the Affymetrix MAS 5.0 statistical software) within the training 
set samples was also removed. A call value is an Affymetrix-specific qualitative 
measure of detection: it indicates whether a gene was expressed ("present"), was 
not expressed ( "absent" ) or was too close to call ( "marginal" ) . This typically reduces 
the initial number of probesets by a third or a quarter. For the van't Veer Breast 
Cancer Outcome data set (the only non-Affymetrix-based data set), two patients 
(samples 53 and 54) had a significant number of missing values and these two 
samples were excluded from subsequent analysis (this was not done in the study 
of van't Veer et al. 32 ). A gene was then removed if there was at least one missing 
expression value among all of the training set samples. Note: If one of the genes in 
the test set had a missing value, then the missing value for that gene was replaced 
with the mean of the normalized expression values for that gene in the training set; 
see §4.3 for data normalization details. 

For the second stage of the gene selection process, a modified version of the 
Recursive Feature Elimination (RFE) algorithm 18 was used. In the RFE algorithm, 
an SVM (or any other hyperplane classifier for that matter) is trained using all 
of the genes from patient samples in the training set. The genes are then ranked 
according to their largest weight-component magnitudes, 

{\w h \,...,\w id \}, 

where {ii, . . . , i c i} are the sorting indices such that \wi t | and \uii d | are the largest and 
smallest weight-component magnitudes, respectively. The genes associated with the 
smallest weight-component magnitudes are removed and the SVM is then retrained 
using the remaining genes. This process is repeated iteratively until there are no 
more genes left. Here, we use KRR to generate the weight vector per iteration: 
^(fe) _ xp(k) wnere / g( fc ) ig t ne solution to the linear system, 

(ir« + A( fe )j„)/3« =y, 
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k is the RFE iteration and K 1 ' is the kernel matrix obtained using only the gene 
subset obtained at iteration k. In the previous work of Guyon et aL 18 , the reg- 
ularization parameter A was fixed across all gene subsets. This strategy tends to 
under-regularize (over-regularize) l<c" when k is small (large). Here, we modify 
the regularization parameter A^- 1 to be the mean of the singular values of 
(or cquivalcntly, the mean of the trace of K^). This allows for fast computa- 
tion (no SVD of is required) and consistent filtering across RFE iterations. 
This filtering also intentionally errs on the side of over-regularization since A« is 
weighted toward the largest singular values (otherwise, if \( ' is too small, then we 
will under-regularize and ovcrfit the training data). 

In this study, the number of genes used in the hypcrplane classifier is varied 
amongst the values in the set: d! = {cZ (all genes), 100, 50, 25, 10}. Our interest is 
not in finding the optimal subset of discriminating genes, but in measuring qualita- 
tive changes of class prediction performance as d! is decreased. Removing spurious 
"gene noise" is a common and valid reason given for using gene selection. However, 
there are other more practical reasons for doing so. Computationally, gene selection 
greatly eases the numerical burden in, say, computing the SVD of X. Clinically, 
assays that require relatively few gene expression levels to be measured are more 
likely to be adopted in a real- world setting. 

4.3. Data normalization 

Expression values corresponding to different genes can differ significantly in magni- 
tude. As a result, one often tries to transform and scale the data such that the ex- 
pression values across patients for a given gene are of the same size. First, the expres- 
sion data was Zog-transformed using the transformation, ccy <— sgn{xij)log2(\xij |), 
since the data tends to exhibit a log-normal distribution for the positive values for 
a fixed gene (note that for the van't veer Breast Cancer Outcome data set, 
the publically-available data was already log-transformed). Second, the expression 
values for a fixed gene across the training set samples were normalized to mean 
zero and unit standard deviation (the expression values for a fixed gene within the 
test set were normalized using the mean and standard deviation from the training 
set). Third, to avoid scaling problems when solving Kf3 = y, both K and y were 
re-scaled so that their component elements were of commensurate size. Following 
the recommendation of Varah 31 , the left- and right-hand sides of K(3 = y were 
rc-scalcd such that \\K\\2 = a\ w 0(1) ~ Hylll- This was easily accomplished using 
the following re-scaling: £ <— E/oi and y <— y/y/n. 

4.4. Model Selection 

iV-fold cross-validation was used to estimate the value of the regularization param- 
eter. In this study, we used N = 10. For TSVD and PLS, the candidate values used 
for k and m, respectively, came from the set {!,..., n}. The candidate values for 
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A came from the set of singular values (A € {o~i, ■ ■ ■ , c„}) since, for Tikhonov and 
Franklin regularization, the effective range for A lies in the interval [0,<7i] 19 . Due 
to the re-scaling of the data to obtain \\K\\\ = o\ = 1, the set of candidate values 
for A lie in the interval [0, 1]. Note that in this study, gene selection is performed 
separately on each cross-validation fold. Failure to do so will induce a gene selection 
bias that yields overly optimistic cross-validation error rates 4 . 

5. Results 

5.1. Software 

The software for the regularized least squares hyperplane classifiers was written 
in MATLAB. The MATLAB Regularization Toolbox 20 was used to calculate the 
CG-based filter factors for the PLS algorithm. 

5.2. Classification performance 

5.2.1. Performance averaged across many partitions 

In gene expression studies, a single partition of the training and test set is often used 
for training and validation. This specific partition is primarily chosen on the basis of 
case-control considerations in order to mitigate any bias that may arise from the way 
the patient samples were collected. However, drawing conclusions about classifier 
performance on the basis of a single training and test set partition can be misleading, 
especially when the sample size n is small relative to the number of dimensions 
d. To avoid this dilemma, we measure classifier performance, on average, across 
many training and test set partitions. To maintain consistency across partitions, 
the number of positive and negative samples in the training and test set are kept 
the same. Overall classification performance was measured as an average across 
1000 training and test set partitions. The average number of misclassifications was 
computed as M = (J2p=i M?>)/1000 in which M p is computed as in Eq.(l) for the 
p th training or test set partition. 

5.3. Trends in overall classification performance 

For each data set, Fig. (2) displays the average number of misclassifications (and cor- 
responding error bars given by the standard deviation) for four different hyperplane 
classifiers and five distinct feature-set choices (number of genes). Three out of the 
four classifiers correspond to the standard least squares techniques (TSVD, KRR 
and PLS) and the fourth classifer corresponds to the two-norm SVM formulation 
of Eq.(ll), hereafter denoted as SVM2. The average number of misclassifications 
for each data set is compared to the number of misclassifications that would be ob- 
tained by majority class prediction (MCP), where a prediction for a new sample is 
determined solely by siding with the class in the test set that has the most members 
(this is not known a priori). 
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Fig. 2. For a given data set, the average number of test set misclassifications (y-axis) was plotted 
as a function of the number of genes (x-axis). Different colors correspond to different hypcrplanc 
classifiers: black for SVM2, the two-norm SVM of Eq.(ll), and red, blue and green for the SVD- 
based filter factor representations of TSVD, KRR and PLS, respectively. The squares correspond 
to the average number of misclassifications and the vertical bars indicate the standard deviation. 
The numbers in parentheses next to the plot title denote the number of misclassifcations that 
would be obtained by majority class prediction (MCP) and the total number of test samples used 
for that specific data set, respectively. 
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In Fig. (2), the subplots in rows one and two correspond to the data sets asso- 
ciated with tumor lineage and outcome, respectively. As expected, class prediction 
tasks related to tumor lineage are 'easy' (as evidenced by the low number of misclas- 
sification and small standard deviations) since tumor lineage dominates the expres- 
sion behavior. For outcome, however, the results were only slightly better than what 
would be obtained by MCP. On average, the choice of regularization scheme did 
not impact class prediction performance. The simplest of the filter factor strategies, 
the TSVD scheme, sufficed in most instances. 

Somewhat surprisingly, gene selection did not improve performance. In fact, 
gene selection slightly degraded performance for outcome prediction as the number 
of genes was decreased. For tumor lineage, only KRR and SVM2 slightly benefited 
from gene selection for d! < 100. This is not too surprising since KRR and SVM2 
share the same regularization mechanism, i.e. the Franklin regularization scheme. 
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Particularly striking was the poor performance of the least squares classifiers relative 
to SVM 2 for d! = 100 in the UNM Infant ALL/AML data set. In this instance, 
the average number of support vectors was 83 patient samples out of a total of 89 
training set samples. Since every patient expression vector is a support vector for the 
regularized least squares classifiers, "outlier" patient expressions vectors (perhaps 
the 6 non-support vectors not used by SVM2) can possibly tilt the weight vector 
w in such a way that does not generalize well to new data points. Aside from this 
one instance, the regularized least squares classifiers, compared against SVM2, were 
competitive in terms of prediction accuracy, considerably simpler to implement, and 
faster in terms of execution time. Since TSVD sufficed in most instances, we will 
now examine its behavior in detail. 

5.4. Spectral coefficients 

The comparable classification performance of TSVD relative to its other, more 
complicated, filter factor rivals suggests that the type of numerical ill-posedness 
exhibited by the LS problems derived from the cancer expression data sets in this 
study tends towards rank-deficiency as opposed to discrete ill-posedness. However, 
classification performance alone is not the only benchmark measure that can be used 
to assess whether LS problems, derived from cancer expression data, tend toward 
rank-deficiency in general. The spectral coefficients in Eq.(5) can also be used to 
indicate the type of numerical ill-posedness. The spectral coefficients consist of 
the singular values ({<7i, . . . , c„}) and the Fourier coefficients {{\uf y\, . . . , |M T y|}). 
Both of these coefficients decrease in value as i increases and, by examining their 
rates of decay, they shed insight into the type of numerical ill-posedness of the 
LS problem. In addition, one can use the decay behavior of these coefficients in 
estimating the truncation parameter for TSVD. The utility of these coefficients in 
determining the type of numerical ill-posedness and the TSVD truncation parameter 
for hyperplane classification of cancer gene expression data is discussed next. 

5.4.1. Gaps in the singular values 

As mentioned previously in Section 3.5, rank-deficiency is characterized by a large 
gap between singular values o~k and c^+i such that /j = 1 for i G {1, . . . , fc} and 
fi = otherwise. In such a case, a general rule of thumb would be that the index of 
where the gap occurs (i.e. k) corresponds to the truncation parameter for TSVD. 
We now want to see if there is a large gap in the averaged singular value spectrum 
across all training set partitions such that ct,; = {J2 P =i ( a i ) 2 )/1000, i = 1, . . . ,n. 
In Fig. (3) (top row), we have plotted the singular values for varying numbers of 
genes in each of the four datasets. The averaged rate of decay when using all genes 
(the solid black line), with the exception of the UNM Pediatric ALL Outcome 
data set, is faster than when using fewer genes. This is to be expected since gene 
selection should, in principle, remove gene noise by keeping only the most class- 
specific genes. For the UNM Pediatric ALL Outcome data set the opposite 
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is true: the rate of decay decreases as d! decreases. What could cause this rank 
inflation as d! decreases in this instance? One possible answer to this question lies 
in the heterogeneity of this data set. The UNM Pediatric ALL Outcome data 
set can be grouped according to the two major lymphocyte subtypes: B-cells (34 
samples) and T-cells (220 samples). They may also be characterized by molecular 
subtype or chromosomal abnormality (30 t(12;21), 30 t(l;19), 14 t(9;22), 20 t(4;ll) 
and 29 hyperdiploid samples, with 131 'other' molecular subtype samples). Since 
tumor lineage and chromosomal abnormalities determine the fate of cancer cells, 
these class distinctions dominate the observed expression behavior in terms of signal 
recovery. Classifying expression on the basis of these class distinctions is expected 
be relatively easy since the gene expression profiles corresponding to different cell 
or chromosomal subtypes have relatively large signal-to-noise ratios. Indeed, this 
is borne out in tests 23 ' 33 . Hence, if one is interested in outcome signal recovery, 
then one has to contend with the relative weakness of the outcome signal compared 
to the dominant biological signals of tumor sublineages and molecular subtypes. 
Furthermore, the gene selection process for outcome may be confounded since the 
signal-to-noise ratio is low: in essence, one is attempting to find outcome-specific 
genes by training on noise. Unfortunately, we believe that this hypothesis cannot be 
adequately addressed using real expression data. Thus, in order to assess whether 
gene selection techniques actually find the class-specific genes of interest when there 
are many competing sources of biological variability, a simulation approach will most 
likely be necessary. 



5.4.2. Fourier coefficients 

For the four gene expression data sets examined, the decay of the Fourier coefficients 
proved more illuminating than the singular value decay in terms of determining the 
type of ill-conditioning. In the presence of noise, the Fourier coefficients \ufy\ will 
level off at a base-line level of background noise determined by the errors in the 
coefficient matrix K for 1 < io < i < n 19 . As a result, the SVD expansion terms 
for i > io should, in principle, be filtered out since these terms are contaminated 
by noise and will dominate the OLS solution. Under this criterion, io can be used 
as a proxy for the TSVD truncation parameter, i.e. we set k « i . 

The second row of Fig. (3) displays the averaged Fourier coefficients, c, = 
(Ep=i l(«D T 2/ P l)/ 1000 > i = l,-..,n. Note that as the number of genes decreases 
from 6! = d to 6! = {100, 50, 25, 10} across all four data sets, there is a qualitative 
change in the decay of Ci. When d! = rf, the decay of c\ (given by the solid black 
line) slowly oscillates toward zero and z ~ r = min{n — 1, d'} (r = n — 1 or nearly 
full rank when n < d and r = d when n > d). This indicates that all terms in the 
SVD expansion should be kept and the OLS solution would suffice in effectively 
classifying the expression data. This was experimentally borne out by examining 
the average number of misclassifications as a function of the truncation parameter, 
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Fig. 3. For a given data set, the y-axes denote averaged singular values (first row), averaged Fourier 
coefficients (second row) and averaged number of misclassifications for the TSVD method (third 
row). The x-axes denote the summation index i or the truncation parameter k in the case of the 
averaged number of misclassifications. Different colors correspond to the different numbers of genes 
used in the analysis (see legend). 
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denoted by M k = (£p=i M£)/1000, where 

N I 
3 = 1 V 

is the number of misclassification per truncation parameter k and partition p. In 
the third row of Fig. (3), M k is plotted against k. When k sa r and d! = d, the 
minimum value or a near minimum value for M k was obtained for tumor lineage 
and outcome, respectively. 

When using fewer genes (d' < d), the decay of Cj (the colored lines) descends 
quickly to such that io ~ 1 (the exception was the UNM Pediatric ALL 
Outcome data set). In this case, io ~ 1 implies that a low-rank approximation 
for K suffices for effective classification. Again, this was experimentally borne out 
since k « 1 was the truncation parameter that minimized Mk (in good agreement 
with the effective resolution limit of io w 1). Notice that for the UNM Pediatric 
ALL Outcome data set, the Fourier coefficient decay was not as rapid as with 
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the other data sets (i.e. io 3> 1). Again, there was good experimental agreement 
between the value of io and the truncation parameter (k 3> 1) that minimized Mfc 
in the third row of Fig. (3). While io does appear to be a reasonable proxy for a 
near-optimal TSVD truncation parameter, caution is still advised in the case of 
outcome prediction since the best misclassification results were, on average, barely 
below MCP. 

6. Conclusion and Future Work 

In this study, we have shown that many popular least squares techniques used 
for linear discrimination can easily be united under the framework of an SVD- 
based filter factor representation. Using the philosophy of Occam's razor as a guide, 
and based upon the four data sets examined, the TSVD regularization scheme 
emerged as the preferred hyperplanc regularization strategy since its performance 
was comparable, on average, to SVM2 and to the other more complicated filter 
factor strategies. The classification performance of TSVD, coupled with the decay 
behavior of the Fourier coefficients in Fig. (3), lends credence to the observation 
that linear inverse problems associated with the hyperplane classification of gene 
expression data tend toward rank-deficiency as opposed to discrete ill-posedness, 
especially when using fewer genes. The spectral coefficients, in particular the Fourier 
coefficients, act as useful numerical diagnostics, indicating when signal recovery for 
outcome or tumor subclass is possible. 

Gene selection did not necessarily lower the number of misclassifications. In- 
deed, for outcome, the best performance was often achieved using all d genes. One 
possible explanation is that the genes responsible for distinguishing differences in 
outcome are dominated by a significant number of differentially expressed genes re- 
sponsible for the major sources of upstream biological variation, e.g. tumor subclass 
distinctions or molecular subtype in the case of leukemia. If this is the case, signal 
recovery for outcome is likely to be severely hampered because the major sources 
of variation (biological and experimental) can inflate the singular value spectrum 
to such an extent that the only signals that remain above the noise background are 
those which dominate the expression behavior. The testing of this hypothesis will 
require extensive gene array simulations and the modeling of cancer gene expression 
as a superposition of tumor subclass, outcome and experimental set effect signals, 
with class-specific sets of genes responsible for distinguishing class subtypes within 
each signal. This is a subject for future study. Alternatively, one can restrict the 
patient sample size to create data subsets that are less heterogeneous in terms of 
dominant sources of variation, i. e. gene selection within samples restricted to B-cell 
or T-cell acute lymphoblastic leukemias only. However, one then pays the price of 
working with data sets that are of extremely small size. In general, gene selection, 
by preserving the most class-specific genes, allows for very aggressive regularization 
by making the kernel matrix more rank-deficient — as indicated by the rapid decay 
of the Fourier coefficients from d' = {all genes} to d 1 < 100 genes. This results in 
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the ability to construct low-rank approximations of the kernel matrix K. 

The main reason for the success of SVD as an analysis tool is that it pro- 
vides a new coordinate system in which the coefficient matrix K becomes diagonal 
(i.e. X = diag(cri, . . . , <7„)). As a consequence, SVD is rank-revealing: the largest sin- 
gular values and vectors capture the relevant solution information by eliminating 
or downwcighting noise terms (using filter factors) in the SVD expansion associ- 
ated with the smallest singular values. However, there are alternatives to the SVD, 
i. e. there exist other rank-revealing matrix factorization techniques such as rank- 
revealing QR or UTV factorizations 19 that can also reliably solve rank-deficient 
LS problems. Since TSVD works well with regularized LS techniques in classify- 
ing gene expression data, it is likely that these alternative matrix factorizations will 
work just as well. Moreover, and unlike SVD, these alternative matrix factorizations 
permit for the efficient updating of rows and columns when they are appended or 
deleted from K 19 , a common scenario in gene expression studies since patients and 
genes are routinely added or deleted from analysis as new clinical and experimen- 
tal information becomes available. Investigations of such alternative factorization 
approaches in hyperplane classifers is currently in progress. 
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